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Searching Apparatus and Method 

The present invention relates to search engines that access databases. The invention is 
particularly but not exclusively related to systems that personalises a search engine by 
creating a user profile. 

An example of an application of the invention is to intranet search engines that access 
large databases such as large corporate repositories holding legal or medical data sets. It 
also applies to renewed data repositories such as news sources. The invention is typically 
integrated with a search platform utilised by users who access and search large 
unstructured databases such as intranets or the Internet. Such platforms may have 
several thousand users. 

Intelligent Personalised Agent Framework, formerly known as Idioms as disclosed in MP 
Thint, B Crabtree, SJ «Soltysiak, Adaptive personal agents, Personal Technologies 
Journal, 2(3):141-151, 1998; B Crabtree, SJ Soltysiak, Knowing me,* knowing you: 
Practical issues in the personalisation of agent technology, In The PAAM'98 Third 
International Conference on the Practical Application of Intelligent Agents and Multi-Agent 
Technology, Practical Application Company, March 23-25 1998; and SJ Soltysiak, 
Intelligent distributed information management systems, Technical report, BTexact 
Technologies, IS Lab, 1999. This system that acts as a host to a community of users and 
provides them with on-line services including news sources or corporate databases. The 
system offers to the users a personalised experience. Within this system, users receive a 
personalised newspapers everyday using a search engine that has access to an 
information source such as Intellact disclosed in B Crabtree, SJ Soltysiak, Automatic ' 
learning of user profiles - towards personalisation of agent services, BT Technology 
Journal, 16(3): 11 0-1 17, 1998. 

I Koychev, Tracking changing user interests through prior-leaning of context, In AH , 2002, 
2nd International Conference on Adaptive Hypermedia and Adaptive Web Based 
Systems, 2002; and D Freitag, J McDermott, D Zabowski, T Mitchel, R Caruana, 
Experience with a learning personal assistant, Communications of the ACM, 7(37):81 - 91, 
disclose profile creation systems that are based on decision tree algorithms that 
have input vectors with a number of features below thirty. In Koychev's approach the 
application does not only rely on a window based approach but the algorithm attempts to 
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freeze an interest in time and save it for future use. When a new interest is found it is 
checked against -pas, Interests" to see if It corresponds to an old interest if it does then 
^ application merges me old interest into the new one; this augments me new inierest 
w,th information that is relevant to it The system enables advantageous learning 
5 capab„ lt ,es. Within the scope of Information Retrieval me number of features in a vector 
are orders of magnitude larger, every keyword that has any relevance must be taken into 
account and consequently the size of a vector rapidly reaches thousands of features. 

In order to adapt user profiles to changes in interests there are two main approaches: the 
10 wrndow frame and the ageing mechanism. Maintaining interests in a window frame is a 
solution that is beneficial to discover and maintain a list of recently introduced interests 
because they appear fast and distinctively as shown In Crabtree ( 19 98) However the 
drawback of the window frame approach is that it Is difficult to retrieve past interests 
Typrcally, if an interest changes or disappears, it is discarded. This has lead to 
15 expenments with optimised -interest forgetting functions" as disclosed in I Koychev 
Gradual forgetting for adaptation to concept drift, In ECAI 2000 Workshop Current Issues 
,n Spato-Temporal Reasoning, pages 101 - 106, 2000. This method is a function that 
decreases the influence of an interest in time; old interests gradually disappear as their 
rmportance is reduced linearly over a period of time. The classification of the interests is a 
20 cnsp set that discards interests when the linear function of the "gradual forgetting" process 
comes to term. 



In order to compensate for the large dimensionality of information retrieval it is known to 
use user feedback in various forms such as the relevance feedback system disclosed in 
25 JJ Rocch.o, Performance Indices for Information Retrieval, Prentice Hall, 1971 or user 
rating as disclosed in D Billsus, M Pazzani, Learning and revising user profiles- The 
.dentrfication of interesting web sites, Machine Learning, 27:313 - 331, 1997. One problem 
related to requiring feedback from users is that in practice users are reluctant to provide 
any feedback regardless of how valuab.e it is to their future requests in the system It 
30 seems that users do not want to interact with the search engine once it has returned the 
results since it is perceived as an annoyance rather than a benefit. 

Embodiments of the invention aim to improve the performance of an on-line search engine 
by gathering and maintaining user profiles obtained by analysing the documents that are 
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relevant to the users. The system builds and maintains user profiles in a two fold process. 
First the system uses an algorithm as disclosed in A Nurnberger, Interactive text retrieval 
supported by self-organising maps, Technical report, BTexact Technologies, IS Lab, 
2002, to extract contextually related keywords from a set of documents. Secondly, the 
5 keywords in the concepts are given attributes: life span and a relevance value. The life 
span indicates to the system when some words within a concept have not been found 
relevant for some time arid therefore should be reduced in importance or removed 
altogether. The relevance value is a link between two keywords of a concept; this value 
reflects the strength of the relation between the two keywords. The users have control 
10 over these parameters. They can decide if words should have a long or a short life span, 
and if the strength between keywords should be strong or weak before they can start 
appearing in their profiles. 

The solution proposed here also offers the users the facility to rebuild a query that is more 
15 valuable based on their initial query and their profile. The interaction with the system is to 
be performed before the documents are retrieved when the users are more receptive to 
more interaction with the system. 

This application helps users maintain a profile of temporary interests. The system also 
20 provides the analysis required to extract keywords that are relevant to help the users build 
an efficient profile. The analysis is based on personal data and therefore the keywords 
suggested to the users are all adapted to their profiles. 

The system helps in maintaining profiles, allowing the users to have an informed control 
25 over their profile. The system is able to identify which are the keywords and concepts that 
the users need to improve their search. The profile obtained can be used for query 
expansion. The users can decide if a keyword is negative or positive to their search. 

Embodiments of the invention will now be described with reference to the accompanying 
30 figures in which: 

figure 1 is a schematic diagram representing the hardware architecture of an embodiment 
of the invention; 

figures 2a and 2b are screen shot of the user interface of an embodiment of the invention 
showing the embodiment in use; 



figure 3 is a schematic nation of the operation of an embodiment of the inventton in 
response to a user input; 

figure 4 is a schematic diagram of the functional elements of the system- 
figure 5 is a flow chart illustrating the embodiment of the invention processing data to 
produce or maintain a list of user interests; 

figure 6 is a schematic representation of ihe processing of the list of interests of figure 5 
into a plurality of fuzzy sets. 



to ih rr to fl9ure 1i a conventionai pc Mmpu,er 101 - <° « ~,03 

10 . uch as a w,de area network (WAN) or, more specifically, toe internet Another computer 

105 may be connected to the WAN 103 via a Local Area Network (LAN, 107 coupled with 
he .access to a gateway server computer (no, shown, that enables the computers 101 

15 omlVr 3 ! 40 ^ 1 ° 3 - AI,ematiVe ' y ' OTnneCBOn 107 "° Prided via 
15 home . Internet access such as breadband and telephone line based access The PC 

ZT Z'T re,erred to as *• *"* machine ' te arranssd to ^ *• — 

computer 105. The client machine 101 has software to be able to access me WAN 103 
The computer 101 has an operating system (e.g. Microsoft Windows - Unix, or Linux, 
and a web browser (e.g. Microsoft Internet Explorer ™, or Netscape Navigator », 
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An ovennew of me user interecUon with the system will now be described with reference 

h « fl9Ure " Ser Ca " e " ,er 3 lnto ,ha ^«em from a 

25 BroL r PrOV,ded ' eXamPl6 USSr "•"» the aCT °" vm *» «» British 

25 Broadcast^ Coloration "BBC-. A -Search" bufton 205 insects me search engine to 

T ed que,y - ln response ,0 ,his me sys,em re,urns a list 207 * —»■» 

keywords as shown ,n ftgure 2b. In tois example toe lis. of keywords 20T comprises .he 
acronyms for some alternative television companies -Grenada" and TTV as well as toe 
oug,na, enby of "BBC-. Tha „s. o, keywords 207 is provided ,o assis. .he users perform' a ■ 
» batter search. The user can select one or mo™ of toe keywords from toe lis, 207 ,o refine 
*en use toe "Refine" button 20 9 ,o submi, toe q ue V . The seleCion can be 
etther posmve or negative i.e. toe keywords can ba included In toe query o, specifically 
excluded via alternative selection indicators 211. Pecir,ca lly 
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As described above, the system returns the list 207 alternative keywords prior to retrieving 
the search results. Alternatively, the system may be arranged to return the results as 
would be expected from a conventional search engine. Along - with the set of results, the 
application would return the list 207 of alternative keywords. 

5 

The process described above with reference to figures 2a & 2b are summarised in figure 
3. The user 301 enters the query into the system 303 at step 305 and system 303 then 
accesses the user profile 307 for that user at step 309. The system then generates a list 
of keywords from the profile 307 at step 31 1 and returns them to the user 301 at step 313 
10 as described above with reference to figure 2b. The user makes their choice of refining 
the search using the list 207 of keywords and the system executes the query or search at 
step 315 taking into account the users refinements using the search engine 317 and the 
database 319. The results are then displayed to the user at step 321 via the system front 
end. 

15 

With reference to figure 4, the core of the system is a profile manager 401 that operates in 
two phases. The first phase uses a word group extraction system 403 to identify related 
keywords from a repository of documents 405. The repository 405 is a set of documents 
that are expected to reflect the users' interests. The extracted groups of related keywords 

20 are representative of those interests of a given user. Each user of the system has a 
document repository 405 which can be maintained either by the user or an automatic 
document retriever (not shown). The processing of the contents of the repository 405 to 
extract the related keywords may be performed off-line. The operation of the word group 
extraction system 403 will be described further below. The second phase is the 

25 classification of the related keywords or interests extracted using an interest classifier 407. 
The interest classifier 407 uses a set of rules 409 to classify interest by their statistical 
significance (importance) in the corpus of text in the repository 405 and by their age (life 
span). The operation of the interest classifier 407 will be described further below. 

30 The output of the profile manager 401 is a set of interests 411 classified by their 
importance in the repository 405 and life span. The profile manager 401 then uses the set 
of interests 411 in response to the input of a query 413 (203, 205 in figure 2a) to provide 
the user with a list of keywords (207 in figure 2b). The management and maintenance of 
the interests is carried out by the profile manager in accordance with a set of rules which 



deS ° nbed tato "- Th6 cement Wudes updating ft. interests from time to 
fme and removing old or outdated interests. The interests 411 are used to refine the 
search as described above. The se, of interests 411 may also be referred to as fhe user 
profile. , n some situations the profile may Include other data describing the users interests 
5 and or preferences. The profile manager 401 requires a set of interests 41 1 before it can 
prov.de a list of key words in response to a user query. As a result, me system needs to 
go through a learning process while the set of interests is initially set up. 

10 The process carried ou, by the profile manager 401 described above win now be, 
descnbed ,n further detail with reference to the flow chart of figure 5. A. step 501 me 
profile manager 401 uses the word group extraction system 403 ,o identify contextual* 
elated keywords within bodies of text in the repository 405. The word grau p extraction 
system 403 uses a self-organlsing map (SOM) algorithm disclosed in T Kohonen Self- 
15 organ,s,ng and associative memory, Springer-Verlag. 1984. The input to the SOM is' word 
tnples (represented in a numerical format). The SOM produces a representation of the 
input words in clusters on a conceptual two-dimensional map where slrongly related 
keywords appear close to one another. For example, if a, ft xand yare words that can be 
found m a text corpus T. if me following two word arrangements are frequent across r : ax 
20 *.andayftthenaand6arecontextuallyrelatedkeywords. 

At step 503 the output of the SOM algorithm Is extracted as a iist on contextual* related 
keywords. The list is represented by a number N of items made of keywords A (a, b ,c) 
B d, e ,f) fcK* Were the upper case letters represent sets of related keywords or 
25 .nterests and lower case letters simp* replant keywords. The set of interests can be 
seen as a personalised ontology, every keyword is associated with the keywords that are 
statistically related to it. 

Processing then moves to step 505 at which the profile manager 401 assigns each 
30 interest an inltia, importance value and a - span vaiue. The importance ^ 
set up as the average Inverse Document Frequency (IDF) value of every keywdrd of the 
.nterest as disclosed in K Sparck Jones, Index term weighting, Infomnafion Storage and 
Refneval, , 9 ):313 - 316, 1973. The IDF value o, a given keyword reflects Ite statistical 
importance into a given text corpus (in this case the user document repository 405) This 
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importance value is normalised so that the weight can be expressed as a percentage 
value. 

Processing then moves to' step 507 where the interest classifier 407 takes each interest in 
5 turn and determines whether it is a new interest or an existing interest. If the interest is a 
new interest processing moves to step 509. 



At step 509 if the interest is the first interest for a new set of interests 41 1 then the profile 
manager 401 creates a new set and the interest added. If the interest is an addition to an 
1 0 existing set 41 1 then it is simply added to the set 411. 



If at step 507 the new interest is identified as an existing interest in the set 41 1 then 
processing moves to step 513. At step 513 each keyword of the new interest is taken in 
turn and if the keyword is part of the existing interest then its weight is increased by a 
15 factor x. In the present embodiment the increase is linear and the factor is set to 1.3. If a 
keyword in the new interest is not present in the existing interest then it is given a weight 
of 1. Once each keyword in the new interest has been processed in this way the weights 
are normalised and the system is able to express the weights as a value between 0 and 1 . 

20 Step 51 1 the profile manager 401 gives each interest a life span expressed in days. In the 
present embodiment this is set to 60 days. A renewed interest is automatically reclassified 
with a 60 day or full life span. The new or updated interests are then added to the set of 
interests 41 1. The existing interest is then replaced with the new or updated interest in the 
set of interests 401. 

25 

Or.cs the profile manager 401 has produced or updated a set of interests 411 it then 
iiiilioes the interest classifier 407 to process the interests 41 1 further. With reference to 
figure 6, the input into the interest classifier is the set of interests 41 1 and the set of rules 
4ua. The interest classifier 407 outputs the set of interests classified into two fuzzy sets 
30 501 , 503. Every interest is classified into one of the three life span fuzzy sets 503a, 503b, 
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5030 and into one of the throe importance weight fuzzy sets 501a, 501b, 501o The 
olassmoatlon of eeoh Merest depends on the life span and importance weights assigned 
to each , merest* in steps 505, 509, 511 and/or 513 of figure 5 as described above 
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As noted above, an interest is given an initial life span (step 511 in figure 5) and is 
classed ,nto one of three fuzzy sets by the Interest olassrfier 407. if the initial 
dassrf.ca.ion is "long" Ihe interest will be sustained in me system for at least as long as 
t e system Is initially set up to (sixty days in the current implementation). This 
dassrficanon ,s reviewed on a regular basis by the fuzzy engine such as when concepts 
are updated or added. If the interest is no, renewed its lifespan will result in a gradual 
downing ,o me "average" set, Ihen to the "short" set and flnaily wil, be remove!, from 
the set of interests 411. ,„ other wores, the dassrfication of an interest into a life span 
fuzzy set is an indication of Its life span expectancy in the system. 



The users may have access to the fttzzy sets configuretlon through an interface to enable 

T.TZt^ C,aSSmCa,,0n ™ e m0dl * * e sto °< *• span 

sets 503a, 503b, 503c and thus modrfy me life span of concepts. To keep concepts longer 

the fuzzy se. of recent concepts 503a is be increased and the sizes of one or more of the 

sets of older concepts 503b, 503c reduced. 
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The importance fuzzy sets 501a, 501b, 501c are used in the seledion of keywords that 
w,ll be suggested to a user in response to the entry of a query. For example, the system 
may be arranged to suggest only strong interests, strong and medium interest or all 
interests. Again the users oan dedde on the size of these date sets so that they have 
25 comre, over selection process. Similarly the system 401 is arranged so that if me system 
is about to discard a concept with strong relevance (because its life span has expired) the 
system can require confirmation from the user. This gives the user the facility to renew .he 
lifespan of the interest if (hey shssse. 

30 interests that have had .heir importance value renewed (step 513 of figure 5, may well 
reman In .he same fuzzy set or they may be upgraded. Others the. have not been 
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renewed may either be sustained a little longer in the same set or they may be 
downgraded. An interest with an updated importance value is not automatically 
reclassified in the -high" fuzzy set, others are gradually downgraded to the "medium" and 
the "low" sets. 



The system is designed to help the users manage their profile efficiently. Yet, the system 
can run without having the users to maintain anything. Users are also allowed to add 
change, and remove concepts. They can thoroughly control their sets of interests 41 / 
repositories 405 and rules 409. The system provides a non-obtrusive software application' 
10 The application gradually builds fuzzy sets of keywords and is able to make helpful 
suggestions to the users. By giving control to the users with regards to the size of the 
fuzzy sets they can manage the maintenance of the profiles and they can build more 
efficient queries. 

15 Self organising maps are discussed further in T Kohonen, Self-organized formation of 
topologically correct feature maps, Biological Cybernetics, 43:59-69, 1982; and H Ritter, 
and T Kohonen, Self-organising semantic maps, Biological Cybernetics, 61(4):241 - 254* 
1989. ' 



20 It will be understood by those skilled in the art that the apparatus that embodies the 
.nvention could be a general purpose device having software arranged to provide the an 
embodiment of the invention. The device could be a single device or a group of devices 
and the software could be a single program or a set of programs. Furthermore, any or all 
of the software used to implement the invention can be contained on various transmission 

25 and/or storage mediums such as a floppy disc, CD-ROM, or magnetic tape so that the 
program can be loaded onto one or more general purpose devices or could be 
downloaded over a network using a suitable transmission medium. 

Unless the context clearly requires otherwise, throughout the description and the claims 
30 the words "comprise", "comprising" and the like are to be construed in an inclusive as 
opposed to an exclusive or exhaustive sense; that is to say, in the sense of "including but 
not limited to". • 
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User Mr Smith's action is sending query 




1: User sends query 

2: User "related keyword" profile is read and additional Boolean 
keywords are extracted 

3: The user is given the Boolean terms to expand his query 
4: The expanded query can go to a search engine (one that can handle 
Boolean querying) 

5: The set of result is extracted from the data repository 

6: The Graphical User Interface displays the results back to the user 



Figure 3 
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Extract keyword association given by the self? 
organising map algorithm 



Calculate the importance weight of every keyword 
pair association + normalise result. The output is a 
value between 0 and 1 



Add interest to set of 
interests and initiate 
importance value 
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Calculate new 
importance weight 



Update life span value 
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